Linear Regression

Linear regression can have some neat interpretations that can give us an estimate of a causal effect of one variable on an outcome that it is modeling. This article will take a look at why this is so and what conditions are necessary.

The Fundamental Problem of Causal Inference

Causal inference is a set of methods that attempt to estimate a causal effect of one variable, usually called the treatment variable, on another variable, called the outcome variable.

Assume we have the data where the treatment $T$ had a certain value $t$ . It boils down to answering the question: "What would have happened if everything was the same as observed, except $T$ had another value $t'$ instead of $t$ ". This question is very difficult to answer and very tricky to differentiate from correlation. Correlation is easy to measure, but it does not answer the question that we posed. It just describes the relationship between two variables, without saying anything about which one causes the other and what changing one would do to the other. This is why answering causal questions is crucial for proper decision making.

The fundamental problem of causal inference can be illustrated by first looking at a hypothetical data set, one we would never be able to have in reality. Consider a medical trial with an outcome variable $y$ and a binary treatment $T$ taking on values of 1 or 0. Fore example, the treatment $1$ could be giving a new medicine to patients, and treatment $0$ could be the placebo.

Now, imagine we give the treatment to all our patients . But, we also have a machine that can let us peak into parallel worlds, and we choose to observe the world where everything was exactly the same, except neither of the patients got the treatment. Then, observing the data from our own and the parallel world, we can create a data set that looks like this:

Outcome 0	Outcome 1
y_1^0	y_1^1
y_2^0	y_2^1
y_3^0	y_3^1
...	...
y_n^0	y_n^1

Estimating a causal effect here is easy, we just check the difference between the two outcomes for each observation, and we make conclusions based on the distribution of those differences. The average difference is usually of most interest and, formally, the average causal effect would simply look like this:

\frac 1 n \sum_{i=1}^n (y_i^1 - y_i^0) = \mathbf E[y | T = 1] - \mathbf E[y | T = 0] \tag{1}

In reality, however, we don't have a machine that can let us peak into a parallel world. So we need to use "tricks" to get around this constraint. One of those "tricks" involves linear regression.

Linear regression is quite famous, and more has been written on it than any other model. Still, in the age of big machine learning models with billions of parameters, it often gets overlooked. What linear regression has that complicated models don't is the ability to make useful interpretations of the model. One of those interpretations is the causal interpretation, which can drive crucial decision making and often has much more business value than sheer predictive power.

I've written about linear regression and the interpretation it can have in this article, and I recommend to read it, as the rest of the article will assume that. Here we'll just give a short summary of that article.

Linear regression is defined by:

It's equation: $\hat{y} = \sum_{i=1}^N \beta_ix_i + \beta_0$
The algorithm that finds the best parameters for equation, which is the least squares in classical linear regression. The goal of least squares is to find parameters $\beta$ that give prediction with the least euclidean distance from the real values given the data.
The Interpretation

The interpretation is what we add to the mathematical definition of the model so that instead of just getting a machine that eats inputs and provides outputs we get a certain understanding of the model and the data. And if we want to answer causal questions, the interpretation we need to have when modeling the outcome $y$ is clear if we just look back to equation (1) above - our predictions need to be interpreted as expected values conditioned on the given predictors.

\hat{y} = \mathbf E [ y | X ] \tag{2}

So let's see what we need to assume in order for this interpretation to be true.

What Assumptions Are Needed?

In order for least squares to yield a model that outputs expected values of the outcome given the predictors, two baseline assumptions are required and both of them are concerning the errors of the model, i.e. residuals:

Residuals of the model are independent and identically distributed (i.i.d).
Residuals are normally distributed¹.

The I.I.D assumption means that the error on one observation does not affect the error on another observation. This usually means that our whole sample too must be independent, as the easiest way to get dependent errors is to provide dependent input.

Note that if residuals are I.I.D., then Homoscedasticity of the residuals is also satisfied, i.e. the variance of the residuals must be the same since their distribution is the same. So, if we find that, after building a linear model, the variance of the residuals is not the same for the whole range of the outcome variable, then it's implied residuals are not I.I.D.

Together with the normality assumption, these two are necessary for the least squares to yield the parameters $\beta$ that are also the maximum likelihood estimate parameters given the data. This then implies the result that we need - that our predictions represent the expected value for the subpopulation described by the predictors.

You are probably wondering why the Linearity assumption is nowhere to be found here for linear regression? Well, that is because it is not really required to interpret the outputs as expected values². However, if the outcome is not a linear function of the predictors in reality, then our errors will necessarily be higher, i.e. the variance of the residuals will be higher.

This has implications on the certainty of the causal effect that we measure, but before we delve into that, let's first see how do we actually measure the causal effect using linear regression.

Linear Regression and The Causal Effect

Let's go back to equation (1) and rewrite it here a little differently:

ACE = \mathbf E[y | T = 1, X] - \mathbf E[y | T = 0, X] \tag{3}

The $ACE$ stands for the average causal effect that we want to measure. $T$ is the treatment variable, and let's keep assuming it's binary as we did before. The new term is $X$ , which is a set of other variables that we use in the equation. Why is this important?

In causal inference, confounders are variables that affect both the treatment and the outcome. These variables are a big problem, since it's often difficult to isolate the effect of the treatment on the outcome from the effect of the confounder. Finding confounders usually involves utilizing domain knowledge, and the $X$ here represents the set of potential confounders that we have observed. Let's assume at the moment that all possible confounders are observed and available to us in the matrix $X$ . That adds the third assumption into our list, very specific to causal inference:

All confounders are observed.

A linear regression model, by definition, using $y$ as it's outcome and $T$ as the treatment, as well as the confounders $X$ as predictors, allows us to estimate the following quantity:

\mathbf E[y | T, X] = \beta^\star T + \sum_{i=1}^n {\beta_iX_i} + \beta_0 \tag{4}

This is the familiar equation from before, except we have separated the coefficient $\beta^\star$ corresponding to the treatment $T$ for convenience.

Now, let's go back to the question we posed back when we talked about the fundamental problem of causal inference:

What would have happened if everything was the same as observed, except $T$ had another value $t'$ instead of $t$ .

If we trust our linear model, the causal question can be rephrased as what outcome we would get if we change $T$ , but hold everything else fixed. Consider an individual outcome (i.e. a single observation) indexed by $j$ - this means we can treat all the $X_i^j$ s as constant. We can thus define a new constant per observation $c^j$ :

c^j = \sum_{i=1}^n {\beta_iX_i^j} + \beta_0 \tag{5}

which then leads us to a linear model for each individual observation if we substitute (5) into equation (4):

\mathbf E[y^j | T^j] = \beta^\star T^j + c^j \tag{6}

It is like every observation has it's own private linear model.

Now, what we want to calculate is equation (3). So now we can plug (6) into (3) and we will get:

\begin{aligned} ACE = \mathbf E[y^j | T^j = 1] - \mathbf E[y^j | T^j = 0] = \beta^\star *1 + c^j - \beta^\star * 0 - c^j = \beta^\star \end{aligned} \tag{7}

This means that the coefficient that corresponds to the treatment variable is in fact the average causal effect³, which is kinda neat!

Things to Watch Out For

There's a couple of things we left unexplained here. One of them is what happens if linearity assumption is seriously violated and our model has a very high error?

Having the coefficient equal to the causal effect also allows us to use the standard error of the coefficient as the standard error for the measure of the causal effect. The thing is, this standard error is proportional to $\sigma^2$ , the variance of the residuals. This means that the higher our error is, the bigger the standard error of the average causal effect, which means less certainty in it's exact value. When comparing two treatments, this can be crucial, as a wider standard error means less statistical power, i.e. we are less likely to show a significant difference between treatments. So if our experiment was expensive to perform, a violated linear assumption can put is in a tough position as we might have wasted a lot of money and still don't have conclusive evidence.

Another small assumption we sneaked into this text is hidden in this sentence:

If we trust our linear model, the causal question can be rephrased as what outcome we would get if we change $T$ , but hold everything else fixed.

Keep in mind that in certain cases it's impossible to change $T$ but keep everything else fixed, because there is a very strong correlation with some of the other variables. This phenomenon is called multicollinearity, and it can famously screw up the linear coefficients. But note that the variables we include along with the treatment are supposed to be confounders, not all variables we have available. Our goal is not predictive power, but getting a measure of causal effect of the treatment.

On the other hand, as we said above, predictive power increases statistical power, so it comes down to a trade-off between having an accurate model by including more useful variables - which will reduce the uncertainty in the estimate of the causal effect - and having an accurate measure of causal effect by lowering multicollinearity. It is something we have to use our domain knowledge for, and there are tools in the field of causal inference that can help us with this, but that's a topic for another article.

Conclusion

In any case, linear regression is still a cool model even today, and when used in the context of causal inference, it can provide a lot of value. I hope this article was helpful in putting that into perspective.

Thanks for reading!

This is a requirement for the least squares loss function, but the normality assumption can in general be replaced by the assumption on another distribution, and the loss function can be adjusted accordingly. In any case, some assumption is needed. ↩
It is a good question of how likely it is that normality and I.I.D assumptions are satisfied if linearity assumption is not satisfied. I don't have an answer for this, but it might be a good research topic for the future. ↩
It's obvious how this generalizes to continuous variables - we can replace $T = 0$ with some reference $T = t$ value, and $T = 1$ with $T = t + 1$ . This would imply that the causal effect of increasing $T$ by one is $\beta^\star$ . The exact interpretation depends on our model and other things, like what transformations we have used, but it is in general possible. ↩